The null hypothesis is that model A and B each have a 1/2 chance to win whenever they are different, ties are not used. The p-value is the chance under the null-hypothesis to get a difference as extreme as the one observed. Hoover over each entry to display the information used to compute p-values.
We can also find the typical p-value for typical difference in accuracy. Hoover to display the actual model pairs for each point.
Following Chatbot Arena, this is the head-to-head comparisons between all pairs of models, reporting wins, and two types of ties.
We show 3 methods currently used for evaluating code models, raw accuracy used by benchmarks, average win-rate over all other models, and Elo (technically Bradly-Terry coefficients following Chatbot Arena). These usually have near-perfect correlation.
| model | pass1 | win_rate | elo | |
|---|---|---|---|---|
| 0 | gpt-4-turbo-2024-04-09+cot | 0.820 | 0.936 | 1544.180 |
| 1 | gpt-4-0613+cot | 0.771 | 0.928 | 1519.130 |
| 2 | claude-3-opus-20240229+cot | 0.820 | 0.876 | 1404.876 |
| 3 | gpt-3.5-turbo-0613+cot | 0.590 | 0.823 | 1323.343 |
| 4 | gpt-4-0613 | 0.687 | 0.762 | 1235.863 |
| 5 | codellama-34b+cot | 0.436 | 0.752 | 1225.902 |
| 6 | gpt-4-turbo-2024-04-09 | 0.677 | 0.744 | 1215.286 |
| 7 | claude-3-opus-20240229 | 0.657 | 0.697 | 1163.230 |
| 8 | codellama-13b+cot | 0.360 | 0.694 | 1162.080 |
| 9 | codellama-7b+cot | 0.299 | 0.601 | 1070.951 |
| 10 | deepseek-base-33b | 0.486 | 0.547 | 1019.993 |
| 11 | deepseek-instruct-33b | 0.499 | 0.546 | 1019.080 |
| 12 | gpt-3.5-turbo-0613 | 0.494 | 0.523 | 1000.000 |
| 13 | codetulu-2-34b | 0.458 | 0.511 | 987.470 |
| 14 | deepseek-base-6.7b | 0.435 | 0.497 | 976.538 |
| 15 | magicoder-ds-7b | 0.444 | 0.479 | 960.910 |
| 16 | codellama-34b | 0.424 | 0.461 | 945.742 |
| 17 | mixtral-8x7b | 0.405 | 0.460 | 946.291 |
| 18 | codellama-13b | 0.397 | 0.439 | 927.468 |
| 19 | wizard-34b | 0.434 | 0.414 | 906.376 |
| 20 | wizard-13b | 0.413 | 0.411 | 904.437 |
| 21 | codellama-python-34b | 0.414 | 0.402 | 897.732 |
| 22 | codellama-python-13b | 0.398 | 0.398 | 893.673 |
| 23 | deepseek-instruct-6.7b | 0.412 | 0.373 | 873.682 |
| 24 | phind | 0.397 | 0.362 | 864.398 |
| 25 | phi-2 | 0.335 | 0.346 | 849.806 |
| 26 | codellama-python-7b | 0.359 | 0.344 | 848.583 |
| 27 | mistral-7b | 0.343 | 0.329 | 833.922 |
| 28 | codellama-7b | 0.342 | 0.327 | 833.731 |
| 29 | starcoderbase-16b | 0.342 | 0.323 | 828.742 |
| 30 | deepseek-base-1.3b | 0.310 | 0.284 | 794.441 |
| 31 | starcoderbase-7b | 0.322 | 0.271 | 782.240 |
| 32 | phi-1.5 | 0.275 | 0.257 | 768.074 |
| 33 | deepseek-instruct-1.3b | 0.287 | 0.229 | 740.721 |
| 34 | phi-1 | 0.217 | 0.154 | 654.672 |